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Abstract 

We introduce a new supervised algorithm for image classification with rejection using multiscale contextual 
information. Rejection is desired in image-classification applications that require a robust classifier but not the 
classification of the entire image. The proposed algorithm combines local and multiscale contextual information 
with rejection, improving the classification performance. 

As a probabilistic model for classification, we adopt a multinomial logistic regression. The concept of rejection 
with contextual information is implemented by modeling the classification problem as an energy minimization 
problem over a graph representing local and multiscale similarities of the image. The rejection is introduced 
through an energy data term associated with the classification risk and the contextual information through an 
energy smoothness term associated with the local and multiscale similarities within the image. We illustrate the 
proposed method on the classification of images of H&E-stained teratoma tissues. 

Index Terms 


classification with rejection, histopathology 


I. Introduction 

In many classification problems, the eost of ereating a training set that is statistically representative 
of the input dataset is often high. This is due to the required size of the training set, and the diffieulty 
of obtaining a eorreet labeling resulting from unelear elass separability and the possibility of presenee 
of unknown elasses. In this work, we were motivated by the need for automated tissue identifieation 
(elassifieation) in images from Hematoxylin and Eosin (H&E) stained histopathologieal slides 
H&E staining is used both for diagnosis as well as to gain a better understanding of the diseases and their 
processes, consisting of the sequential staining of a tissue with two different stains that have different 
affinities to different tissue eomponents. 

In this paper, we are interested in a subclass of image classification problems with the following 
eharaeteristies: 

• The elassifieation is not directly based on the observation of pixel values but on higher-level features; 

• The eharaeteristies of the image make it impossible to have access to pixelwise ground truth, leading 
to small, unbalanced, noisy, or ineomplete training sets; 

• The pixels may belong to unknown elasses; 

• The elassifieation aeouraey at pixels belonging to interesting or known elasses is more important than 
the elassifieation aeeuracy at pixels belonging to uninteresting or unknown elasses; 

• The need for high aeouraey surpasses the need to olassify all the samples. 
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A. Goal 

In problems as above, introdueing a rejeetion yields improvements in the elassifieation performanee — 
classification with rejection. Further improvements in aeeuraey ean be obtained by exploiting spatial and 
multilevel similarities — classification using contextual information. Our goal is to combine classification 
with rejection and classification using contextual information in an image classification framework 
to obtain improved elassifieation performance. 


B. Classification with Rejection 

A classifier with rejection can be seen as a coupling of two classifiers: (1) a general classifier that 
classifies a sample and (2) a binary classifier that, based on the information available as input and output 
of the first classifier, decides whether the classification performed by the first classifier was correct or 
incorrect. As a result, we are able to classify according to the general classifier, or reject if the decision 
of the binary classifier is that the former classification is incorrect. 

A classifier with rejection allows for coping with unknown information and reducing the effect of 
nonideal training sets. It was first analyzed in Q, where Chow’s rule for optimum error-reject trade-off 
was presented. Based on the posterior probabilities of the classes given the features for the classification, 
Chow’s rule allows for the determination of a threshold for rejection, such that the classification risk is 
minimized. The authors in [|^ point out that Chow’s rule only provides the optimal error-reject threshold 
if these posterior probabilities are exactly known. They propose the combination of class-related reject 
thresholds to improve the error-reject trade-off. Parameters are selected using the constrained maximization 
of the accuracy subject to upper bounds on the rejection rate as a performance metric. In Q, the authors 
present a mathematical framework for binary classification with rejection. In that approach, the rejection 
is based on risk minimization and the cost for each different binary classification error considered. 

Usually, the rejection is applied as a plug-in rule to the outputs of a classifier. It is also possible, 
however, to combine the output of multiple classifiers (multiple general classifiers) to create rejection. In 
[j^, the authors present a multi-expert system based on a Bayesian combination rule. The reliability of 
the classification is estimated from the posterior probabilities of the two most probable classes, and the 
rejection works by thresholding the reliabilities. 

Another approach is to include the rejection in the classifier itself as an embedded rejection instead of a 
plug-in rule. In [[^, the rejection is embedded in a Support Vector Machine (SVM), in which the rejection 
is present in the training phase of the SVM and included in the formulation in close association with the 
separating hyperplane resulting from the SVM. This leads to a nonconvex optimization problem that can 
be approximately solved by finding a surrogate loss function. In p0| and pl] |, the statistical properties 
of a surrogate loss function are studied and applied to the task of rejection by risk minimization. In [ 12|, 
the use of LASSO-type penalty for risk minimization is analyzed. 

Yet another approach consists in having a second classifier with access to the input and output of 
the first classifier instead of a plug-in rule or an embedded rejection. In [13|, the second classifier is 
trained with the main classifier to assess the reliability of the main classifier. The rejection is based on 
thresholding the reliability provided by the second classifier. 

More recently, in p^ , the authors present a framework for the multilabel classification problem with 
rejection. A trade-off between the accuracy of the nonrejected samples and the rejection cost is found as 
a result of a constrained optimization problem. Furthermore, an application-specific reliability measure of 
the classification with rejection inspired on the F-score (weighted harmonic mean of precision and recall) 
is defined. 

In the present work, we propose a classification system with rejection using contextual information. To 
assess the performance of the method, in addition to the fraction of rejected samples r and the classification 
accuracy on the subset of nonrejected samples A, we use the concept of classification quality Q and 
rejection quality [ 151. The classification quality can be defined as the accuracy of a binary classifier that 
aims to classify correctly classified samples as nonrejected and incorrectly classified samples as rejected. 
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Maximizing the classification quality leads both to keeping correctly classified samples and rejecting 
incorrectly classified samples. The classification quality allows us to compare different classifiers with 
different rejection ratios and accuracies. The rejection quality can be defined as the positive likelihood 
ratio of a binary classifier that aims to classify correctly classified samples as nonrejected and incorrectly 
classified samples as rejected. It compares the proportion of correctly classified to incorrectly classified 
samples in the set of rejected samples to the proportion on the entire data. The rejection quality provides 
insight into the ability of a classifier with rejection to concentrate incorrectly classified samples in the set 
of rejected samples. 


C. Classification with Contextual Information 

The basic assumption for classification with contextual information is that the data is not spatially 
independent: in most real-world data, two neighboring pixels are likely to belong to the same class. 
This assumption can be extended to include multiple definitions of a neighborhood: local, nonlocal, and 
multiscale. 

The use of contextual information is prevalent in tasks in which the spatial dependencies play an 
important role, such as image segmentation and image reconstruction [16|. In [I7|, the authors formulate 
a discriminative framework for image classification taking in account spatial dependencies. This framework 
allows both the use of discriminative probabilistic models and adaptive spatial dependencies. 

For the purposes of our application, we can learn from hyperspectral image classification, where the use 
of of contextual information is prevalent p^ , [jT^. We model classification with contextual information 
as a Discriminative Random Field (DRF) with the association potential linked with the pixelwise 
class posterior probabilities and the interaction potential linked with a multilevel logistic (MLL) Markov 
random field (MRF) p0| endowed with a neighboring system associated with a multi-scale similarity 
graph. This MLL-MRF promotes segmentations in which neighboring samples are likely to belong to the 
same class at multiple scales, leading to multi-scale spatial consistency among the classifications. 


D. Classification with Rejection Using Contextual Information 


Expert classification Contextual rejection 



Similarity analysis 


Fig. 1. Classification with rejection using contextual information. Each gray block is discussed in a separate section: similarity analysis in 
Section III, expert classification in Section IV, and contextual rejection in Section V. 


The proposed framework, shown in Fig. combines classification with rejection with classification 
with contextual information. Our approach allows for not only rejecting a sample when the information 
is insufficient to classify, but also for not rejecting a sample when an ’’educated guess” is possible based 
on neighboring labels (local and nonlocal from the spatial point of view). We do so by transforming the 
soft classification (posterior distributions) obtained by an expert classifier into a hard classification (labels) 
that considers both rejection and contextual information. 

An expert classifier is designed based on application-specific features and a similarity graph is con¬ 
structed representing the underlying multiscale structure of the data. The classification risk from the expert 
classifier is computed and the rejection is introduced as a simple classification risk threshold rule in an 
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extended risk formulation. This formulation eonsists in a maximum a posteriori (MAP) inferenee problem 
defined on the similarity graph, thus eombining rejeetion and eontextual information. 

Compared with elassifieation with rejeetion only, our approaeh has an extra degree of eomplexity: the 
rejeetion depends not only on a rejeetion threshold for the elassifieation but also on a rejeetion eonsisteney 
parameter. By imposing a higher rejeetion eonsisteney, the rejeeted samples beeome rejeetion areas (that 
is, a nonrejeeted sample surrounded by rejeeted samples will tend to be rejeeted too), whieh is meaningful 
in the task of image elassifieation. 

Compared with elassifieation with eontextual information only, this problem is of the same eomplexity, 
as the rejeetion ean be treated as a elass, and elass-speeifie transitions ean be easily modeled. 


E. Outline of the Paper 

In Seetion we deseribe the baekground for our framework: partitioning, feature extraetion, and 
elassifieation. In Seetion IIP we explore the similarity analysis bloek of the framework and the design 
of a multilevel similarity graph that represents the underlying strueture of the data. In Seetion we 
deseribe the elements of the expert elassifieation bloek of the framework not deseribed on the baekground. 
We introduee the rejection as a mechanism for handling the inability of the classifier to correctly classify 
all the samples. In Section |V| we combine the expert classification and the multiscale similarity graph in 
an energy minimization formulation to obtain classification with rejection using contextual information. 
In Section |W we apply our framework to classification of real data: natural images, and H&E-stained 
teratoma tissue images. Finally, Section [Vn| concludes the paper. 


II. Background 

We now describe the background for our work in terms of image partitioning, features, classification, 
and methods used to compute the MAP solution. 

Let S = s} denote the set of pixel locations, z, G denote an observed vector at pixel 

i e S, I = [zi, Z 2 ,..., Zg] G denote an observed image, P = {xi,..., x„} denote a partition of S, 
V = {l,...,n} denote a set indexing the elements of the partition P termed superpixels, and £ = V xV 
denote a set indexing pairs of neighboring superpixels. Given that P is a partition of S, then Xj C S, for 

f G TV, Xj n Xj = 0 for i j e M, and U”^;^Xj = S. 


A. Partitioning 

To decrease the dimensionality of the problem, and thus the computational burden, we partition the 
set of pixel locations S into a partition P, allowing for the efficient use of graph-based methods. The 
partitioning of the image is performed by oversegmentation creating superpixels as described in . This 
method, as is typical in most segmentation techniques, aims at maintaining a high level of similarity inside 
each superpixel and high dissimilarity between different superpixels. 

Because of how the superpixels are created (measuring the evidence of a boundary between two regions), 
there is a high degree of inner similarity in each partition element; the elements of a superpixel will very 
likely belong to the same class. The major drawback of using this partitioning method is that the partition 
elements are highly nonuniform in terms of size and shape. 


B. Features 

We use two kinds of features: (1) application-specific features encode expert knowledge and are used 
to classify each partition element, and (2) generic similarity features represent low-level similarities of 
the image and are used to assess the similarity among the partition elements. From each partition element 
Xj, we extract statistics of the application-specific features and of the similarity features (from all pixels 
belonging to the same partition element), mapping from features defined on an image pixel space to 
features defined on an image partition space. 
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C. Classification 


Given the partition P and the associated feature matrix F = [fi,..., f„], with fj G MP the m-dimentional 
application-specific features , we wish to classify each partition element x* G P into a single class. We do 
so by assigning to it a label y* G £ = {1,..., A^} representative of its class. This assignment is performed 
by maximizing the posterior distribution p(y|F) with respect to y = [|/i,..., that is, by computing 
MAP labeling 


y G argmaxp(y|F) 
ye£" 


( 1 ) 


We note that under the assumption of conditional independence of features given the labels p(y|F) = 
Hies and of equiprobable class probabilities p{yi) = p{yj), for all i,j G S, we can reformulate 
the MAP formulation in ([T]) as 

y G arg min - logp{yi\ii) - logp(y). (2) 

y£jCn 

ies 


For the posterior p(y|F) we adopt the DRF model pT|, 


p(y|F) oc exp 


- (1 - 

iev 



(3) 


where —P(j/j,fj) is the association potential, which links discriminatively the label yi with the feature 
vector fj, — F{j ?/j) is the interaction potential, which models the spatial contextual information, and 

a G [0,1] is a regularization parameter that controls the relative weight of the two potentials. The posterior 
Q is a particular case of the DRF class introduced in pT| , because the association potential does not 
depend on the partition elements. The DRF model used constitutes an excellent trade-off between model 


complexity and goodness of the inferences, as shown in Section VI 


To completely define Q, we need to specify the association potential —D and the interaction potential 
In this work, we start from the assumption that —D{yi,ii) = logp(j/i|fj, W), resulting from @ 
and ([5]), where p(2/j|fj, W) is the multinomial logistic regression (MLR) [211 parameterized with the matrix 
of regression coefficients W, — F{ij}(j/j, = Wij5y^^y., where Wij > 0 is a weight to be defined late,r 


and 6ij is the Kronecker symbol (i.e., 6ij = 1 if i = j and 6ij = 0 if i 7 ^ j). This class of association 
potentials, which define a MLL-MRF prior [ |20| , promotes neighboring labels of the same class. In the 
following subsection we address the learning of the MLR regression matrix W detail. 

1) Multinomial Logistic Regression: Let fc(f) = [ko{fkq{f)]'^ denote a vector of nonlinear 
functions /cj : MP — )■ M, for i = 0, ... ,q, with q the number of training samples and with fco = 1. 
The MLR models the a posteriori probability of j/j G £ given f G as 


p(y, = l\f,W) = 


^fc(f) 




^fc(f) ’ 


(4) 


where W = [wi,...,W 7 v] G the matrix of regression coefficients. Given that p(|/j|f,W) is 

invariant with respect to a common translation of the columns of W, we arbitrarily set Wjv = 0. 

2) Learning the Regression Coefficients W: Our approach is supervised; we can thus split the dataset 
into a training set V = {(?/*, fj), i G T}, where T C V is a set indexing the labeled superpixels, and the 
set {fj, i G V — T} containing the remaining unlabeled feature vectors. Based on these two sets and on 
the DRF model 0. we can infer matrix W jointly with the MAP labeling y. Because it is difficult to 
compute the normalizing constant of p(y|F), this procedure is complex and computationally expensive. 

Aiming at a lighter procedure to learn the matrix W, we adopt the sparse multinomial logistic regression 
(SMLR) criterion introduced in p^, which, fundamentally, consists in setting a = 0 in Q, that is. 
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disconnecting the interaction potential, and computing the MAP estimate of W based on the training set 
V and on a Laplacian independent and identically distributed prior for the components of W. We are 
then led to the optimization 

W G argmax /(W) + logp(W), (5) 

w 

with /(W) = logp(j/j|fi, W) the log-likelihood, and p(W) oc e the prior, where A is the 

regularization parameter and ||W||i^i denotes the sum of the ii norm of the columns of the matrix W. The 
prior p(W) promotes sparsity on the components of W. It is well known that the Laplacian prior (the ii 
regularizer in the regularization framework) promotes sparse matrices W, that is, matrices W with most 
elements set to zero. The sparsity of W avoids overfitting and thus improves the generalization capability 


of the MLR, mainly when the size of the training set is small The sparsity level is controlled by the 
parameter A. 

3) LORSAL: We use the logistic regression via variable splitting and augmented Lagrangian (LOR- 
SAL) algorithm (see [18|) to solve the optimization Q. The algorithm is quite effective from the 
computational point of view, mainly when the dimension of fc G is large. 

LORSAL solves the equivalent problem 


min-/(W) + Aliril 
w,n 


1 , 1 ) 


subject to: W = 17 


( 6 ) 


The formulation in ([^ differs from the one in Q in the sense that logp(W) is replaced by logp(17) with 
the constraint W = 17 added to the optimization problem, introducing a variable splitting. Note that —l(W) 
is convex but nonquadratic, and A||17||i i is convex but nonsmooth, thus yielding a convex nonsmooth and 


nonquadratic optimization. LORSAL approximates ((W) by a quadratic upper bound [211, transforming 
the nonsmooth convex minimization @ into a sequence of £ 2-^1 minimization problems solved with the 
alternating direction method of multipliers [|23|. 


Given a set of indices corresponding to the training samples T and its respective training set = 
a radial basis function (RBF) is a possible choice of function in the vector of nonlinear 
regression function k used in Q, which allows us to obtain a training kernel (computed by a RBF kernel 
of the training data). This allows us to deal with features that are not linearly separable. To normalize 
the values of the nonlinear regression function, the bandwidth of the RBF kernel is set to be the square 
root of the average of the distance matrix between the training and test sets. With both the regressor 
matrix W and the nonlinear regression function k defined, we obtain the class probabilities from the 
MLR formulation in Q. 


D. Computing the MAP Labeling 

From Q, we can write the MAP labeling optimization as 

arg min (1 — a) E ^iVi) ^ ^ ^{i,j}iyii Vj) ■ 

{ij'lef 


(7) 


iev 


This is an integer optimization problem, which is NP-hard for most interaction potentials promoting 
piecewise smooth segmentations. A remarkable exception is the binary case (when N = 2) and submodular 
interaction potentials, which are the interaction potentials that we consider; in this case the exact label 
can be computed in polynomial time by mapping the problem onto suitable graph and computing a 


min-cut/max-flow on that graph [24|. 


We find an approximate solution to this problem by using the a-expansion algorithm [16|, [251. With 
the constraint that V^ij^ is metric in the label space, the local minimum found by a-expansion is within 
a known factor of the global minimum of the labeling. 
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III. Similarity Analysis 

Similarity analysis is the first step (see Fig. of the proposed approaeh. To represent similarities in 
the image, we eonstruet a similarity multiseale graph by (a) partitioning the image at different seales 
and (b) finding both loeal and multiseale similarities. The partitioning of the image at eaeh seale is 


eomputed from the oversegmentation that results from using superpixels [26|. The different seales used 
for partitioning refleet a compromise between computational cost associated with large multiscale graphs, 
and the performance gains achieved by having a multiscale graph that correctly represents the problem. 
The construction of a similarity multiscale graph (as exemplified in Fig. allows us to encode local 
similarities at the same scale, and similarities at different scales. The edges of the similarity multiscale 
graph define the cliques present in Q. This knowledge can be used to improve the performance of the 
classification, as neighboring and similar partitions are likely to belong to the same class. 



(a) Multiscale graph (b) Multiscale graph 

and multiscale partitioning structure 

Fig. 2. Multiscale graph superimposed on the result of partitioning the same image at different scales (a) and on planes denoting the 
different scales (h). Nodes are denoted hy circles, intrascale edges by gray lines, and interscale edges by black lines. 


A. Multiscale Superpixels 

We obtain a multiscale partitioning of the image by computing superpixels at different scales, that is, 
selecting increasing minimum superpixel sizes (MSS) for each superpixelization. This leads to multiple 
partitions on which the minimum number of pixels in each partition element is changed, corresponding 
to a scale of the partition. The scale selection must achieve a balance between spatial resolution and 
representative partition elements (with sufficient size to compute the statistics on the features). 

B. Design of the Similarity Multiscale Graph 

The design of the similarity multiscale graph is performed in three steps: (1) compute a graph for each 
single scale partition; (2) connect the single scale partition graphs; and (3) compute similarity-based edge 
weight assignment and prune edges. The main idea is that a partition will have an associated graph. By 
combining partitions with different scales (an inverse relation exists between the number of elements of 
a partition of an image and the scale associated with that partition), we are able to combine graphs with 
different scales. This will be the core of the similarity multiscale graph. 


























1) Single Scale Graph as a Subgraph of the Multiscale Graph: Let us consider Ps(/) = the 

set of partition elements x| obtained by partitioning of the image / at scale s. We associate a node nl to 
each partition element xf G Ps{I) and defined the set of nodes at scale s as 

V. = U{<}' 

i 

There is a one-to-one correspondence between partition elements x| and nodes nf For each pair of adjoint 
partition elements (partition elements that share at least one pixel at their boundary) at scale s, (x|,xp, 
we create an undirected edge between the corresponding nodes. We have that the set of intrascale edges 
at scale s is 

where is the set of neighbor nodes of n|, that is, the set of nodes that correspond to the partitions 

adjoint to the partition x|. Let Qg = {Vs,£s) denote the graph associated to scale s. The union, for all 
scales, of the single scale graphs, that is. 


is itself a graph that represents the multiscale partitioning of the image, without edges existing between 
nodes at different scales. 

2) Multiscale Edge Creation: The multiscale graph is obtained by extending the union of all single¬ 
scale graphs \JgQs to include interscale edges. For s' > s, let ri{nf,s') be a function returning a node 
at scale s' such that, for j = r]{n\,s'), we have x|' fl x^, 7 ^ 0 ; that is, j = ri{nf,s') is a node at scale 
s' whose corresponding partition element Xj' has non empty intersection with the partition element x^^. 
Based on this construction, a partition element cannot be related to two or more different larger scale 
partition elements but can be related to multiple lower level partition elements. Let £^(*,*+ 1 ) be the set of 
edges between nodes in and V^+i; we have that 


%^+i)=u u 

i j=,,(n|,s+l) 


The set £^(s,s+i) contains edges between adjacent scales, connecting the finer partition at a lower scale to 
the coarser partition that a higher scale. A node at scale s has exactly one edge connecting to a node at 
scale s -f 1 and at least one edge connecting to a node at scale s — 1. 

Considering a set of scales S, we have that the multiscale graph Q resulting from the multiscale 
partitioning is 


Q = 


151 


| 5 | 





= (V,^). 


interscale edges 


intrascale edges 

3) Edge Weight Assignment: Given the multiscale graph Q, we now compute and assign edge weights 
based on similarity. Let /si be a function that computes similarity features on the node nf, corre¬ 
sponding to the partition element x/ The weight of the edge {nl,nf) e S is computed as 

oc n(s,s')exp(-||/si(n.) -/si(n/)||V t), ( 8 ) 




where 7 is a scale parameter, exp (—||/si(n,|) — )|P/ 7 ) quantifies the similarity between two nodes 

nf and v{s, s') = Wmirascaie, if s = s', and v{s, s') = Wintercaie, if s 7 ^ s'. The rationale for different weights 
for intrascale and interscale edges comes from the different effect of the multiscale structure. For a given 
value of intrascale weight, lower values of the interscale edge weight downplay the multiscale effect on 
the graph, and higher values of the interscale edge weight accentuate the multiscale effect. 
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IV. Expert Classification 


The expert elassifieation bloek of the system is eonstrueted from two sequential steps: feature extraetion 
and elassifieation. The feature extraetion step eonsists in eomputing the applieation-speeifie features and 
extraeting statisties of the features on eaeh of the lowest level partitions. In the elassifieation step, the 
elassifier is trained, applied to the data, and the elassifieation risk is eomputed. As the feature extraetion 
procedure was introduced in Section |II-B| and is application-dependent, and the classification procedure 
was described in Section |II-C[ we will focus on the computation of the classification risk. 


A. Rejection by Risk Minimization 

By approaching classification as a risk minimization problem, we are able to introduce rejection. To 
improve accuracy at the expense of not classifying all partitions, we classify while rejecting. Let L' = 
L U {N + 1} be an extended set of partition class labels with an extra label. The rejection class can 
be considered as an unknown class that represents the inability of the classifier to correctly classify all 
samples. The extra label + 1 corresponds to this rejection class. 

1) Classification with Rejection by Risk Minimization: Given a feature vector fj, associated to a partition 
element Xj, and the respective (unobserved) label Hi G C, the objective of the proposed classification with 
rejection is to estimate yi, if the estimation is reliable, and do nothing (rejection) otherwise. 

To formali z e the classification with rejection, we introduce the random variable yi G C, for i G V, 
where yi = N + 1 denotes rejection. In addition, let us define a (A^ + 1) x cost matrix C = [cj^^j^] 
where the element Cj^^^ denotes the cost of deciding that yi = ji, when we have yi = j 2 and does not 
depend on z G V. 

Let the classification risk of jji = k conditioned to fj be defined as: 

Rifii = k\fi) = Ey,[c(^i = k,yi)\fi] 

N 

= = J2|fi,W). 

i2=i 

By setting cn+ij^ = we get 

N 

R{yi = k,k^ N + l|fi) = ^ Ckj^vhjm = j 2 |fi, W), 

i2=i 

R{yi = k,k = N+ l\fi) =p. (9) 

By minimizing Q over all possible partition labelings we obtain 

y = arg min ^ ^ i?(z/i|fi). (10) 

Note that if Cj^^^ = 1 — where 6n is the Kronecker delta function, minimizing ( fTO] ) yields 

r argmaxp(z/j|fi, W), maxp(z/j|fi, W) > 1 - p; 

yi \ ViCjCi ^^gL 

[ A^ + 1, otherwise. 

In other words, if the maximum element of the estimate of the probability vector is large, we are reasonably 
sure of our decision and assign the label as the index of the element; otherwise, we are uncertain and 
thus assign the unknown-class label. 
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2) Including Expert Knowledge: Expert knowledge ean be ineluded in the risk minimization. Class 
labels ean be grouped in L superelasses C = {£i,..., (eaeh super elass is an element of the partition 
of the set of elasses C) on whieh miselassifieation within the same superelass should have a eost different 
than miselassifieations within different superelasses. 

Let us now eonsider the following eost elements with a eost g for misolassifieation within the same 
superelass, 

f 0 if 3i = ja; 

^h ,32 ^ { 9 ^2 belong to the same superelass; 

I 1 otherwise. 

The expeeted risk eonsidering expert knowledge of seleeting the elass label pi G L' in the partition is 

N 

R'im = k,k ^ N + 1%) = ^ c'f,j^p{ym = j2\ii, W), 

i2=i 

R'{yi = k,k = N+ l\ii) =p. (11) 

Minimizing ( [TT] ) over all possible partition labelings yields 

= arg min V | f^, W). 
jgv 

This formulation allows us to inelude expert knowledge in the assessment of a risk of assigning a label. 

V. Contextual Rejection 

A. Problem Formulation 

We formulate the problem of elassifieation with rejeetion using eontextual information as a risk mini¬ 
mization problem defined over the similarity multiseale graph Q. 

As shown in 0. we ean pose the elassifieation problem as an energy minimization problem of two 
potentials over the undireeted graph Q = (V, representing the multiseale partitioning of the image I. 
The assoeiation potential D is the data term, the interaetion potential for {i,j) G £, is the eontextual 
term, and a G [0,1] is a weight faetor that balanees the relative weight between the two is denoted as 
eontextual index. Then, 

y = arg min (1 - a) ^ D{yi, f*) + a ^ %)• (12) 

*ev (i,j)ee 


B. Association Potential: Expert Knowledge 

The assoeiation potential measures the disagreement between the labeling and the data; we formulate 
it as a strietly inereasing funetion of the elassifieation risk in ( [TT] ): 

D(|/„f,) = \og{E!{y, I f„ W)), for i G V. 

This unary assoeiation potential is assoeiated with the nodes V of the graph (partitions), and ineludes the 
rejeetion that is present in the elassifieation risk R'. 
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C. Interaction Potential: Similarity 

The interaction potential is based on the topology of the graph Q, combining intra and inter level 
interactions between the pairs of nodes connected by edges, based on their similarity. We define an 
interaction function ip that enforces piecewise smooth labeling among the pairs of nodes connected by 
edges. 

In the design of the similarity multiscale graph, the difference between intralevel and interlevel edges is 
encoded in different multiplier constants of the edge weight This allows us to work with intralevel and 
interlevel edges in the same way, without increasing the complexity of the pairwise potential. Accordingly, 
we set 

where tCij, for {i,j) G £, corresponds to the edge weight defined in 

1) Interaction function: The interaction function ip enforces piecewise smoothness in neighboring 
partitions; its general form is ip{yi,yj) = 1 — Sy^^y., that is 0 if ?/* = yj and 1 otherwise. 

It is desirable, however, both to ease the transition into and out of the rejection class, and ease the 
transitions between classes belonging to the same superclass. We achieve this by adding a superclass 
consistency parameter ipc and a rejection consistency parameter ipn to the interaction potential as follows: 

'O if yi = yj-, 

ipc if yi and yj belong to the same superclass; 
ipR if yi = N + 1 or yj = N + 1; 

1 otherwise. 

Defining a rejection consistency parameter ipR allows us to have an interaction function that can be 
metric, meaning that the interaction potential will be metric. Another effect is the ability of controlling 
the structure of the rejected area. With a rejection consistency parameter close to 0 we obtain a labeling 
with structure with unstructured rejection; this means that rejection areas can be spread on the image and 
can consist of one partition element only. With a higher value, we are imposing structure both on the 
labeling but also on the rejection areas, leading to larger and more compact rejection areas. 


i^{yi,yj) = < 


VI. Experimental Results 

With the framework for image classification with rejection using contextual information in place, we 
will now show examples of its application in real data. The main applicational area of the framework is 
tied with a subclass of image classification problems described in the introduction: ill-posed classification 
problems where the access to representative pixelwise ground truth is prohibitive; the pixels can belong 
to uninteresting or unknown classes; and the need for thigh accuracy surpasses the need to classify all 
samples. 


The first example, the classification of natural images (Section VTA), illustrates the generality of the 
framework. Whereas designed for a subclass of image classification problems, the proposed framework 
can also be applied to more general image classification problems: supervised segmentation of natural 
images. The second example, the classification of H&E stained teratoma tissue images (Section |VI-B[), 


shows the advantages of using a robust classification scheme combining rejection and context on the main 
applicational area of this framework. 

With the classification of natural images, we also explore the effect of the graph structure on the 
classification of an image: how the classification with rejection propagates through the different layers of 
the multiscale graph; and how the number of scales, or “depth” of the multiscale graph, affects the 
performance of the classification. With the classification of H&E images, we also explore the joint 
interaction between context and rejection in the classification problem, and the behavior of the framework 
as the difficulty of the classification problem increases. 
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As the concept of combining classification with context with classification with rejection in pixelwise 
image classification is novel, there are no competing methods nor frameworks to compare to. To provide 
an assessment of the performance of the framework, we compare the performance of the framework with 
the performance of context only, and with the performance of rejection only, with selection of optimal 
rejected fractions. 


A. Natural Images 

We illustrate the flexibility of the formulation, by applying the formulation to the classification of 
natural images (Fig. |^. We obtain a multilevel classification from an image extracted from the BSD500 
data set ^n\ . 

1) Experimental setup: Both the application-specific features (for classification) and the similarity 
features (for graph construction) are the color on the RGB colorspace, and the statistic extracted from the 
partition elements is the sample mean of the RGB color space inside the partition element. This means 
that, for the Ah partition element Xj, both the application-specific and the similarity features consist of the 
sample mean of {zj,j G Xj}, the RGB color space inside the partition element. The number of classes 
is iF = 3, where 10 randomly selected superpixels from the lowest scale are used to train the classifier. 
No superclass structure is assumed. 

2) Ejfect of the multiscale graph on the classification: The effect of the multiscale graph on the 
classification is illustrated on Fig. finner segmentations on the smaller scales, with disjoint rejected 
areas; and coarser segmentations on the larger scales with a large rejection area. Due to the characteristics 
of the superpixelization, the class boundaries appear natural in all scales. 



Fig. 3. Example of classification with rejection (in black) across multiple levels in a natural image from the BSD500 data set. 


We illustrate the robustness of the framework with regard to the number of scales by comparing the 
classification performance with a varying number of scales (Fig.|^. The variation of the number of scales 
is achieved by stacking coarser single-scale graphs on the multiscale graph, through an increase of the 
minimum superpixel size (MSS) by a factor of 2: 1 scale corresponds to a single scale graph of MSS 
100, 2 scales to a multiscale graph of MSS of (100, 200), up to 11 scales, that corresponds to a multiscale 
graph of MSS of (100, 200,..., 6400). 

In Fig. 1^ it is clear the performance improvement of using multiscale similarity graphs (more than one 
scale) against single scale similarity graphs (just one scale). The stabilization of the mean performance 
for more than 4 scales is an indicator of the robustness of the framework with regard to the number of 
scales. 
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1 23456789 10 11 

number of scales 




Fig. 4. Evolution of classification performance with number of scales. Results obtained from 30 Monte Carlo runs with different training 
sets of 10 randomly selected samples per class of the image in Fig. The variation of performance for more than 4 scales is negligible. 
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B. H&E Data Set 

Our H&E data set consists of 36 1600 x 1200-size images of H&E stained teratoma tissue slides imaged 
at 40X magnification containing 20 classes; Eig.[^ shows three examples. 

1) Experimental Setup: As application-specific features we use the histopathology vocabulary (HV) 0 . 
0. These features emulate the visual cues used by expert histopathologists [|T||, Q, Q, and are thus 
physiologically relevant. Erom the HV, we use nucleus size (ID), nucleus eccentricity (ID), nucleus density 
(ID), nucleus color (3D), red blood cell coverage (ID), and background color (3D). As similarity features 
we use the color on the RGB colorspace. 

The statistic extracted for the application-specific and the similarity features, on the lowest level of the 
partition, consists of the sample mean of the feature values on the partition. It is a balance between good 
classification performance, low feature dimensionality, and low complexity. This results in 10 dimensional 
application-specific feature vectors, and 3 dimensional similarity feature vectors. The superclasses are 
constructed from the germ layer (endoderm, mesoderm, and ectoderm). Classes derived from the same 
germ layer will belong to the same superclass. 

The multiscale similarity graph is built with six scales with a MSS of (100,200,400,800,1600,3200) 
for each of the layers of the similarity graph. This provides a compromise between the computational 
burden associated with large similarity graphs and the performance increase obtained. The results we 
present with six scales are marginally better than the ones achieved with five or seven scales. 

2) Parameter Analysis: In this section we analyze the impact of regularization parameter A on the 
EORSAE algorithm; the contextual index a; and the rejection threshold p. The regularization parameter 
A describes the generalization capability of the classifier. The contextual index a describes the contextual 
information; a = 0 means no contextual information and a = 1 means no classification information is 
taken in account. The rejection threshold p denotes our confidence in the classification result; lower values 
of p denote low confidence in classification and higher values of p denote high confidence in classification. 

To evaluate the parameters, we define two types of training sets, based on the origin of the training 
samples: (1) A single image training set composed of k samples Sk, extracted from a test image. This 
training set is used to train the classifier and is applied to the entire image. (2) A training set Sk,k 
containing k training samples from each image of a given set. This training set is used to evaluate the 
classifier in situations in which we have no knowledge about the tissues. Note that each of the 36 H&E 
images not only contains a different set of tissues, but was also potentially stained and acquired using 
different experimental protocols, with no guarantee of normalization of the staining process. 

The remaining parameters are set empirically according to the experts. The interscale (uinterscaie) and 
intrascale (uintrascaie) weights for the similarity graph construction are set to 4 and 1, respectively, to achieve 
a “vertical” consistency in the multiscale classification. Earger values of the interscale when compared to 
the intrascale will enforce a higher multiscale effect on the segmentation: the different layers of the graph 
will be more similar to each other. 

The superclass misclassification cost g is set to 0.7; the superclass consistency -00 and rejection consis¬ 
tency V’r are set to 0.7 and 0.5, respectively, to ease transitions into same superclass tissues and rejection, 
and to maintain a metric interaction potential. Earger values of the superclass consistency V'c lead to 
smaller borders (in length) between elements of the same superclass, and smaller values lead to larger 
borders. The value of the rejection consistency ipr affects the length of the border of the rejected areas 
(their perimeter): smaller values of ^jJr lead to disconnected rejected areas (with a large perimeter), usually 
thin rejection zones between two different classes, whereas larger values of i/'r lead to connected rejected 
areas (with a small perimeter), usually rejection blobs that reject an entire area. To achieve similar levels 
of rejected fraction, the rejection threshold p must accomodate the value of the ijjr as larger values of ipr 
mean more costly rejection areas. 

a) LORSAL Parameter Analysis: By varying the value of A in we obtain different regressors 
Wa (one matrix of parameters per value of A. We expect that by increasing the value of A up to a 
point, a regressor with greater generalization capability can be obtained, thus with increased classification 
performance. However, increasing A furthermore will lead to lower performance, as the sparsity term in 
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the optimization will overwhelm the data fit term. On the other hand, lower values of A will lead to an 
overfitted regressor, that will eause loss of performanee. 

To evaluate the generalization eapability of the elassifier, we test it with an entire data set training set 
*S'75,75- With the entire data set training set ereated, eaeh image is elassified by the following maximum a 
posteriori elassifier for eaeh of the regressors obtained for different values of A: 

Pi = argim^p(|/j = ^ \ f*, Wa) (14) 

The overall aeeuraey is eomputed for eaeh image, as well as the sparsity of the regressor Wa. 


accuracy relative regressor sparsity 



Fig. 5. LORSAL parameter analysis. Effect of A on the overall accuracy values and sparsity of W. Mean accuracy (in black), standard 
deviation (in gray), overlapped with the results for all images. Note the three zones of accuracy behavior: no effect, increase, decrease. The 
maximum overall accuracy (66.4%) is obtained for A = 10 with a value of relative regressor sparsity of 0.352. 


From Figure it is clear that there exist three different zones of accuracy behavior with the increasing 
sparsity of the regressor: 

• For 0 < A < 1 there is no effect — the data term vastly outweights the regularization term; 

• For 1 < A < 10 there is an increase in classification performance — increasing the regularization 
term will improve the generalization capability of the classifier; 

• For A > 10 there is a decrease in classification performance — increasing the regularization term 
will hamper the capability of the classifier. 

We empirically choose A to be 10, as it maximizes the overall accuracy of the classifier. 

b) Effect of contextual index, and rejection threshold in the classification performance: The inclusion 
of rejection in the classification leads to problems in the measurement of the performance of the classifier. 
As the accuracy is measured only on the nonrejected samples, it is not a good index of performance 
(the behavior of the classifier can be skewed to a very large reject fraction that will lead to nonrejected 
accuracies close to 1). To cope with this, we use the quality of classification Q [15|. The intuition being 
that, by maximizing Q, we maximize both the number of correctly classified samples not rejected and 
the number of incorrectly classified samples rejected. By varying the value of the contextual index a 
in ( fT^ , we are weighting differently the role of contextual information in the classification. For a = 0, 
no contextual information is used, equivalent to ( [T4l ), whereas for a = 1, the problem degenerates into 
assigning a single class to the entire image. By varying the value of the rejection threshold p in Q, we 
are assigning different levels of confidence to the classifier, i.e., p = 0 is equivalent to no confidence on 
the classifier (reject everything), whereas p = 1 is equivalent to total confidence in the classifier (reject 
nothing). 

As the contextual index a and the rejection threshold p interact jointly, we now analyze the classification 
quality Q for different situations. 










Rejection threshold p Rejection threshold p 
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(c) Q for S '240 (max. Q 0.88). 


(d) Q for S' 60,60 (max. Q 0.80). 


Fig. 6. Variation of quality of classification Q with the contextual index a and the rejection threshold p for four different training sets. 
Adjacent contour lines correspond to a 0.01 variation of Q. It is clear a shift to lower dependency on rejection and contextual information 
as the size of the training set, and consequently the classifier performance, increases. 
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Fig. 7. Variation of nonrejected accuracy with the contextual index a and rejection threshold p. The dark line corresponds to the level set 
of quality of classification Q equal to 99% of its maximum value. The maximum nonrejected accuracy is 85%, corresponding to p = 0.46 
and a = 0.58. The corresponding rejection fraction r is 4.6%. 


We test with three single image training sets S'eo, 5'i2o, >5240, corresponding roughly to using 1.5%, 3% 
and 6% of the samples of the image. We test with an entire data set training set .56o,60, on which only 
3% of the data set is composed of samples from the test image. For each type of training set, we use as 
test images each of the 36 images of the data set, presenting the mean value of Q. 

From Figure we can observe the variation of the performance of the classifier with a and p for 
different situations. The change from (a) to (c) corresponds to an increase in the dimension of the training 
set. Both the improvement of the maximum value and the shift to lower values of contextual index and 
higher values of rejection threshold can be explained by increasing performance of the classification. This 
means that a more reliable classification is available, decreasing the need to use contextual information 
and rejection. On the other hand, (d) corresponds to an extreme situation in which the training set is 
highly noisy, with only 3% of samples belonging to the test image. The high dependency of contextual 
information in this case is clear. The maximum value of Q is attained at lower values of the rejection 
threshold and higher values of the contextual index. 

3) Parameter Selection: As seen in Figure the quality of classification varies with the type of 
applications; applications for which the training set is easier will lead to lower reliance on contextual 
information and rejection, and harder training sets will lead to the opposite. In order to select a single 
set of parameters, we combine the results of the four different training sets for each of the 36 images, 
obtaining the average of the classification quality Q and nonrejected accuracy for the resulting 4 x 36 
instances. Our motivation for the selection of the parameters is to maximize the accuracy of the nonrejected 
fractions within a zone of high classification quality. To do so, we select the region of high values of Q 
{Q higher than 99% of its maximum value). Then we select the parameters that maximize the nonrejected 
accuracy, as seen in Figure |7J 

4) Results: We present results of our method on a set of 3 images from the data set containing a 
different number of classes (as seen in Figure [^. The classifications are obtained with different training 
sets to illustrate different challenges. In image 1, to create a small and nonrepresentative training set, 
the training set is composed of 5 randomly chosen partition elements per class (roughly 0.6% of total). 
In image 2, to create a representative training set, the training set is composed of 120 randomly chosen 
partition elements from the entire image (roughly 3% of total). In image 3, to create a small representative 
training set with high class overlap, the training set is composed of 20 randomly chosen partition elements 
from the entire image (roughly 0.5% of total). In all cases, the A parameter is set to 5, with the rest of 
the parameters unchanged. 
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TABLE I 

Classification and rejection performance metrics for the example images in Figure[8] Classification with 

REJECTION AND CONTEXT (WHITE BACKGROUND), CLASSIEICATION WITH CONTEXT WITHOUT REJECTION (GREEN BACKGROUND), 
CLASSIEICATION WITHOUT CONTEXT WITH REJECTION (RED BACKGROUND), AND CLASSIFICATION WITHOUT CONTEXT AND WITHOUT 

REJECTION (BROWN BACKGROUND). 

Image Nonrejected accuracy Rejected fraction Rejectio quality Classification quality Accuracy with no rejection 


Classification with rejection and context 


1 

0.701 

0.347 

3.37 

0.662 

0.600 

2 

0.891 

0.067 

10.11 

0.868 

0.862 

3 

0.967 

0.140 

9.97 

0.866 

0.937 


Classification with rejection without context 


1 

0.702 

0.370 

3.90 

0.673 

0.582 

2 

0.878 

0.031 

9.69 

0.868 

0.863 

3 

0.936 

0.000 

3920 

0.936 

0.935 


TABLE II 

Class-specific results eor the example images in Eigure[8J 


Tissue Train 

type samples 

Test Rejected 

samples samples 

Rejection Nonrejected Classification Accuracy 

quality accuracy quality no rejection 

Image 1 

Other 

5 

410 

134 

0.77 

0.72 

0.55 

0.75 

Fat 

5 

54 

1 

0.00 

0.94 

0.93 

0.93 

Gastrointestinal 

5 

1036 

170 

3.90 

0.91 

0.83 

0.86 

Smooth muscle 

5 

1283 

529 

1.81 

0.69 

0.64 

0.58 

Mesenchyme 

5 

454 

174 

4.04 

0.53 

0.66 

0.38 

Mat. neuroglial 

5 

369 

143 

1.82 

0.35 

0.53 

0.29 

Image 2 

Other 

30 

885 

24 

5.80 

0.91 

0.90 

0.90 

Fat 

13 

510 

48 

4.54 

0.77 

0.75 

0.74 

Skin 

36 

1157 

37 

20.35 

0.98 

0.96 

0.97 

Mesenchyme 

41 

1268 

127 

6.17 

0.86 

0.83 

0.82 

Image 3 

Bone 

2 

725 

246 

1.60 

0.75 

0.64 

0.69 

Mesenchyme 

18 

3195 

319 

11.27 

1.00 

0.91 

0.99 


We analyze both overall results (in Table and elass-speeifie results (in Table 0. The eomputation of 
the rejeetion quality is based on the results of elassifieation with eontextual information and no rejeetion 
{i.e. eomparing the labeling with rejeetion to the labeling resulting from setting the rejeet threshold p to 
1 in ([T^). 

In Table we eompare the performanee of olassifieation with eontextual information and rejeetion with 
eontext only (obtained by setting p = 1) and with elassifieation with rejeetion only with optimal rejeeted 
fraetion (obtained by sorting the partition elements aeeording to maximum a posterior probability and 
seleeting the rejeeted fraetion that maximizes the elassifieation quality). 

Comparing the performanee results of elassifieation with rejeetion using eontextual information (white 
baekground in Tab. |I]) with the results of elassifieation with eontext only (red baekground in Tab. |^, the 
improvement in elassifieation aeeuraey at the expense of introdueing rejeetion is elear. For images 1 and 
2, this ean be aehieved at levels of elassifieation quality higher than aeeuraey of eontext only, meaning 
that we are rejeeting miselassified samples at a proportion that inereases the number of eorreet deeisions 
made (the underlying eoneept of elassifieation quality). For image 3, due to the high aeeuraey of eontext 
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(a) Original image. (b) Ground truth. (c) Classification result. 

Fig. 8. Example of classification results for H&E stained samples of teratoma imaged at 40X containing multiple tissues: Image 1 (first row) 
background (light pink), smooth muscle (dark pink), gastro intestinal (purple), mature neuroglial (light brown), fat (dark brown); mesenchyme 
(light blue); Image 2 (second row) background (light pink), fat (dark hrown), mesenchyme (light blue), skin (green); Image 3 (third row) 
mesenchyme (light green); bone (dark blue). Rejected partitions are shown in black. The training set consists of: 5 randomly chosen partitions 
per class (roughly 0.6% of total) for image 1, 120 randomly chosen partitions (roughly 3% of total) for image 2, 20 randomly chosen 
partitions (roughly 0.5% of total) for image 3; with the A parameter set to 5. 


only (and of the classification with no context and no rejection, brown background in Tab. |I]), the increase 
in accuracy is at the expense of rejecting a comparatively large proportion of correctly classified samples, 
leading to a smaller value of classification quality. 

Comparing the performance results of classification with rejection using contextual information with the 
results of classification with rejection only with optimal rejected fraction (red background in Tab. the 
results are comparable for images 1 and 2, meaning we can achieve a performance improvement similar 
to the achieved by rejection with optimal rejected fraction through the introduction of context. For image 
3, due to the high accuracy of classification with no context and no rejection (brown background in Tab. 
|I]), the optimal rejected fraction is 0, meaning that the increased accuracy is at the expense of rejecting a 
comparatively large proportion of correctly classified samples. 

Analyzing the classification in Fig. the effects of combining rejection with contextual information 
are clear. We obtain significant improvements for image 1 by combining classification with context with 
classification with rejection in terms of classification quality and nonrejected accuracy, thus revealing the 
potential of combining classification with rejection with classification with context. For image 2, only the 
class boundaries are rejected, leading to high values of overall rejection quality and class-specific rejection 
quality. In image 3, it is clear the effect of noisy training sets (due to the image characteristics), where 
a significant amount of the class boundaries are rejected, and the classification quality is lower than the 
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accuracy of the original classification with no context and no rejection. 

Finally, we point to the usefulness of the elassifieation quality Q. By analysis of the elassifieation 
quality, it is possible to eompare the performanee of the elassifier with rejeetion in different situations 
and note how the performanee will deerease as the eomplexity of the problem inereases (by inereasing 
the number of elasses). 


VII. Conclusions 

We proposed a elassifier where by eombining elassifieation with rejeetion with elassifieation using 
eontextual information we are able to inerease elassifieation aeeuraey. Furthermore, we are able to impose 
spatial eonstraints on the rejeetion itself departing from the eurrent standard of image elassifieation with 
rejeetion. These eneouraging results point towards potential applieation of this method in large-seale 
automated tissue identifieation systems of histological slices as well as other elassifieation tasks. 
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